Overview

Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.

Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.

Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.

For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors.

0.1 Objectives

  • PCA
  • SVD
  • Clustering Analysis
  • Linear Regression

0.2 Review materials

  • Study Module 2: PCA
  • Study Module 3: Clustering Analysis
  • Study lecture 4: Multiple regression

1 Case study 1: Self-seteem

Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.

In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.

The data is store in NLSY79.csv.

Here are the description of variables:

Personal Demographic Variables

  • Gender: a factor with levels “female” and “male”
  • Education05: years of education completed by 2005
  • HeightFeet05, HeightInch05: height measurement. For example, a person of 5’10 will be recorded as HeightFeet05=5, HeightInch05=10.
  • Weight05: weight in lbs.
  • Income87, Income05: total annual income from wages and salary in 2005.
  • Job87, Job05: job type in 1987 and 2005, including Protective Service Occupations, Food Preparation and Serving Related Occupations, Cleaning and Building Service Occupations, Entertainment Attendants and Related Workers, Funeral Related Occupations, Personal Care and Service Workers, Sales and Related Workers, Office and Administrative Support Workers, Farming, Fishing and Forestry Occupations, Construction Trade and Extraction Workers, Installation, Maintenance and Repairs Workers, Production and Operating Workers, Food Preparation Occupations, Setters, Operators and Tenders, Transportation and Material Moving Workers

Household Environment

  • Imagazine: a variable taking on the value 1 if anyone in the respondent’s household regularly read magazines in 1979, otherwise 0
  • Inewspaper: a variable taking on the value 1 if anyone in the respondent’s household regularly read newspapers in 1979, otherwise 0
  • Ilibrary: a variable taking on the value 1 if anyone in the respondent’s household had a library card in 1979, otherwise 0
  • MotherEd: mother’s years of education
  • FatherEd: father’s years of education
  • FamilyIncome78

Variables Related to ASVAB test Scores in 1981

Test Description
AFQT percentile score on the AFQT intelligence test in 1981
Coding score on the Coding Speed test in 1981
Auto score on the Automotive and Shop test in 1981
Mechanic score on the Mechanic test in 1981
Elec score on the Electronics Information test in 1981
Science score on the General Science test in 1981
Math score on the Math test in 1981
Arith score on the Arithmetic Reasoning test in 1981
Word score on the Word Knowledge Test in 1981
Parag score on the Paragraph Comprehension test in 1981
Numer score on the Numerical Operations test in 1981

Self-Esteem test 81 and 87 (two tests that were made)

We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.

The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree (thermometrics)

  • Esteem 1: “I am a person of worth”
  • Esteem 2: “I have a number of good qualities”
  • Esteem 3: “I am inclined to feel like a failure”
  • Esteem 4: “I do things as well as others”
  • Esteem 5: “I do not have much to be proud of”
  • Esteem 6: “I take a positive attitude towards myself and others”
  • Esteem 7: “I am satisfied with myself”
  • Esteem 8: “I wish I could have more respect for myself”
  • Esteem 9: “I feel useless at times”
  • Esteem 10: “I think I am no good at all”

1.1 Data preparation

Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?

Student Answer:

NLSY79<-read.csv("./data/NLSY79.csv")
str(NLSY79) # data format
## 'data.frame':    2431 obs. of  46 variables:
##  $ Subject       : int  2 6 7 8 9 13 16 17 18 20 ...
##  $ Gender        : chr  "female" "male" "male" "female" ...
##  $ Education05   : int  12 16 12 14 14 16 13 13 13 17 ...
##  $ Income87      : int  16000 18000 0 9000 15000 2200 27000 20000 28000 27000 ...
##  $ Job05         : chr  "4700 TO 4960: Sales and Related Workers" "10 TO 430: Executive, Administrative and Managerial Occupations" "7900 TO 8960: Setters, Operators and Tenders" "5000 TO 5930: Office and Administrative Support Workers" ...
##  $ Income05      : int  5500 65000 19000 36000 65000 8000 71000 43000 120000 64000 ...
##  $ Weight05      : int  160 187 175 246 180 235 160 188 173 130 ...
##  $ HeightFeet05  : int  5 5 5 5 5 6 5 5 5 5 ...
##  $ HeightInch05  : int  2 5 9 3 6 0 4 10 9 4 ...
##  $ Imagazine     : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ Inewspaper    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Ilibrary      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MotherEd      : int  5 12 12 9 12 12 12 12 12 12 ...
##  $ FatherEd      : int  8 12 12 6 10 16 12 15 16 18 ...
##  $ FamilyIncome78: int  20000 35000 8502 7227 17000 20000 48000 15000 4510 50000 ...
##  $ Science       : int  6 23 14 18 17 16 13 19 22 21 ...
##  $ Arith         : int  8 30 14 13 21 30 17 29 30 17 ...
##  $ Word          : int  15 35 27 35 28 29 30 33 35 28 ...
##  $ Parag         : int  6 15 8 12 10 13 12 13 14 14 ...
##  $ Number        : int  29 45 32 24 40 36 49 35 48 39 ...
##  $ Coding        : int  52 68 35 48 46 30 58 58 61 54 ...
##  $ Auto          : int  9 21 13 11 13 21 11 18 21 18 ...
##  $ Math          : int  6 23 11 4 13 24 17 21 23 20 ...
##  $ Mechanic      : int  10 21 9 12 13 19 11 19 16 20 ...
##  $ Elec          : int  5 19 11 12 15 16 10 16 17 13 ...
##  $ AFQT          : num  6.84 99.39 47.41 44.02 59.68 ...
##  $ Esteem81_1    : int  1 2 2 1 1 1 2 2 2 1 ...
##  $ Esteem81_2    : int  1 1 1 1 1 1 2 2 2 1 ...
##  $ Esteem81_3    : int  4 4 3 3 4 4 3 3 3 3 ...
##  $ Esteem81_4    : int  1 2 2 2 1 1 2 2 2 1 ...
##  $ Esteem81_5    : int  3 4 3 3 1 4 3 3 3 3 ...
##  $ Esteem81_6    : int  3 2 2 2 1 1 2 2 2 2 ...
##  $ Esteem81_7    : int  1 2 2 3 1 1 3 2 2 1 ...
##  $ Esteem81_8    : int  3 4 2 3 4 4 3 3 3 3 ...
##  $ Esteem81_9    : int  3 3 3 3 4 4 3 3 3 3 ...
##  $ Esteem81_10   : int  3 4 3 3 4 4 3 3 3 3 ...
##  $ Esteem87_1    : int  2 1 2 1 1 1 1 2 1 1 ...
##  $ Esteem87_2    : int  1 1 2 1 1 1 1 2 1 1 ...
##  $ Esteem87_3    : int  4 4 4 3 4 4 4 3 4 4 ...
##  $ Esteem87_4    : int  1 1 2 1 1 1 2 2 1 4 ...
##  $ Esteem87_5    : int  2 4 4 4 4 4 4 3 4 4 ...
##  $ Esteem87_6    : int  2 1 2 2 1 1 2 2 1 1 ...
##  $ Esteem87_7    : int  2 2 2 1 1 2 2 2 2 1 ...
##  $ Esteem87_8    : int  3 3 4 2 4 4 4 3 4 3 ...
##  $ Esteem87_9    : int  3 2 3 2 4 4 3 3 3 4 ...
##  $ Esteem87_10   : int  4 4 4 2 4 4 4 3 4 4 ...
attach(NLSY79)
## The following object is masked from package:ISLR:
## 
##     Auto
## see if there are missing values
sum(is.na(NLSY79))
## [1] 0
## Comment: some of the missing values do exist, but they are not na, e.g. they are spaces. But those instances only occur in the Job05 variable. But being unspecified doesn't necessarily suggest it's a na value. It might be because the respondents feel that there's no category for their job.

As seen, there are no missing values in the data set.

1.2 Self esteem evaluation

Let concentrate on Esteem scores evaluated in 87.

  1. Reverse Esteem 1, 2, 4, 6, and 7 so that a higher score corresponds to higher self-esteem. (Hint: if we store the esteem data in data.esteem, then data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to reverse the score.)

Student answer: reverse data score of Esteem 1, 2, 4, 6, and 7 respectively

Reverse it s.t. higher score corresponds to higher self esteem. (Here, we don’t change the original dataset, we simply change it in he stored data set of esteem data)

esteem_data<-NLSY79[,c(37:46)]
esteem_data[,c(1, 2, 4, 6, 7)]  <- 5 - esteem_data[,c(1, 2, 4, 6, 7)]
colnames(esteem_data)<-c("Esteem87_1","Esteem87_2","Esteem87_3","Esteem87_4","Esteem87_5","Esteem87_6","Esteem87_7","Esteem87_8","Esteem87_9","Esteem87_10")
  1. Write a brief summary with necessary plots about the 10 esteem measurements.

Student Answer: the 10 esteem measurements by length:

Since we want to demonstrate the distribution of each of the 10 esteem measurements, a two-dimensional single plot may not suffice (it doesn’t display esteem scores’ distribution if we set the x-axis to be each of the 10 measurement).

So we resort to histogram plot with multiple groups:

p1 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_1), bins = 10, fill = "blue") + labs( title = "Esteem87_1", x = "Score" , y = "Frequency")

p2 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_2), bins = 10, fill = "blue") + labs( title = "Esteem87_2", x = "Score" , y = "Frequency")

p3 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_3), bins = 10, fill = "blue") + labs( title = "Esteem87_3", x = "Score" , y = "Frequency")

p4 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_4), bins = 10, fill = "blue") + labs( title = "Esteem87_4", x = "Score" , y = "Frequency")

p5 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_5), bins = 10, fill = "blue") + labs( title = "Esteem87_5", x = "Score" , y = "Frequency")

p6 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_6), bins = 10, fill = "blue") + labs( title = "Esteem87_6", x = "Score" , y = "Frequency")

p7 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_7), bins = 10, fill = "blue") + labs( title = "Esteem87_7", x = "Score" , y = "Frequency")

p8 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_8), bins = 10, fill = "blue") + labs( title = "Esteem87_8", x = "Score" , y = "Frequency")

p9 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_9), bins = 10, fill = "blue") + labs( title = "Esteem87_9", x = "Score" , y = "Frequency")

p10 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_10), bins = 10, fill = "blue") + labs( title = "Esteem87_10", x = "Score" , y = "Frequency")

gridExtra::grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,ncol=2) ## facet two plots side by side

  1. Do esteem scores all positively correlated? Report the pairwise correlation table and write a brief summary.

Student Answer:

From the pairwise correlation matrix below, we observe that the correlation matrix’s all entries are positive. However, the extent of correlation ranges from 0.24 to 0.70. This is naturally because each score of the esteem-87 tests all measure esteem, where higher scores mean higher esteem.

corr_pairwise<-cor(esteem_data,method="pearson")
round(corr_pairwise,2)
##             Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1        1.00       0.70       0.45       0.53       0.40       0.46
## Esteem87_2        0.70       1.00       0.44       0.55       0.40       0.48
## Esteem87_3        0.45       0.44       1.00       0.41       0.55       0.41
## Esteem87_4        0.53       0.55       0.41       1.00       0.38       0.51
## Esteem87_5        0.40       0.40       0.55       0.38       1.00       0.40
## Esteem87_6        0.46       0.48       0.41       0.51       0.40       1.00
## Esteem87_7        0.38       0.41       0.34       0.42       0.37       0.60
## Esteem87_8        0.27       0.28       0.35       0.30       0.38       0.41
## Esteem87_9        0.24       0.26       0.35       0.29       0.35       0.36
## Esteem87_10       0.31       0.33       0.46       0.37       0.44       0.44
##             Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1        0.38       0.27       0.24        0.31
## Esteem87_2        0.41       0.28       0.26        0.33
## Esteem87_3        0.34       0.35       0.35        0.46
## Esteem87_4        0.42       0.30       0.29        0.37
## Esteem87_5        0.37       0.38       0.35        0.44
## Esteem87_6        0.60       0.41       0.36        0.44
## Esteem87_7        1.00       0.39       0.35        0.39
## Esteem87_8        0.39       1.00       0.43        0.44
## Esteem87_9        0.35       0.43       1.00        0.58
## Esteem87_10       0.39       0.44       0.58        1.00
  1. PCA on 10 esteem measurements. (centered but no scaling)

    1. Report the PC1 and PC2 loadings. Are they unit vectors? Are they uncorrelated?

Student Answer: we perform PCA on the 10 esteem measurements variables.

pc_esteem.10<-prcomp(esteem_data,scale=FALSE) ## by default, center=True but scale=FALSE
names(pc_esteem.10) ## pc_esteem.4 is the PC scores of PC1, PC2, ..., PC10 

## ten variables here so we have 10 PC loadings
pc_esteem.10$rotation[,c(1,2)]
## [1] "sdev"     "rotation" "center"   "scale"    "x"       
##               PC1    PC2
## Esteem87_1  0.235 -0.374
## Esteem87_2  0.244 -0.367
## Esteem87_3  0.279 -0.149
## Esteem87_4  0.261 -0.321
## Esteem87_5  0.312 -0.131
## Esteem87_6  0.313 -0.209
## Esteem87_7  0.299 -0.163
## Esteem87_8  0.393  0.332
## Esteem87_9  0.398  0.578
## Esteem87_10 0.376  0.260

PC1 Loadings and PC2 Loadings are displayed above, which are two tuples with ten entries. It describes the direction of the line.

b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)

Student Answer: Comment: If all loadings are all negative, then it would be exactly the same as taking the positive loadings.

PC1: It is a line that minimizes the total squared distance between the data points and the PC1 line (a line that goes through the origin, and it’s the linear combination of all 10 esteem measurements that minimize the total squared perpendicular distance.

It is approximately 0.235 Esteem87_1 + 0.244 Esteem87_2 + 0.279 Esteem87_3 + 0.261 Esteem87_4 + 0.312 Esteem87_5 + 0.313 Esteem87_6 + 0.299 Esteem87_7 + 0.393 Esteem87_8 + 0.398 Esteem87_9 + .376 Esteem87_10

It is proportional to the total centered and scaled 10 test scores. Higher PC1 scores suggest higher weighted score of esteem level.

PC2: Approximately proportional to the difference between the last three esteem scores (8,9,10) and the first seven esteem scores. If total scores are comparable, higher PC2 implies strong esteem response from the last three tests while lower PC2 implies vibrant esteem response from the first seven tests.

c) How is the PC1 score obtained for each subject? Write down the formula.

Student answer:

pc1_esteem<-pc_esteem.10$x[,c(1)] # computational formula for PC1 Score

Formula: PC1_Score = 0.235 Esteem87_1 + 0.244 Esteem87_2 + 0.279 Esteem87_3 + 0.261 Esteem87_4 + 0.312 Esteem87_5 + 0.313 Esteem87_6 + 0.299 Esteem87_7 + 0.393 Esteem87_8 + 0.398 Esteem87_9 + 0.376 Esteem87_10

d) Are PC1 scores and PC2 scores in the data uncorrelated? 

Student Answer:

PC1 and PC2 should be pairwise uncorrelated, as illustrated below. The value is approximately just 0.

cor(pc_esteem.10$x[,c(1)],pc_esteem.10$x[,c(2)])
## [1] 1.12e-14
e) Plot PVE (Proportion of Variance Explained) and summarize the plot. 

Student Answer:

plot(summary(pc_esteem.10)$importance[2, ], ylab="PVE",
     xlab="Number of PC's",
     main="Scree Plot of PVE for NLSY79")

Summary of analysis: Take the number of PC’s when there is a sharp drop in the proportion of variance as number of PC’s increase from 1 to 2. It indicates that two leading PC’s should be sufficient, since beginning from PC3, the proportion of variance of each additional PC is less than 0.1.

f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?

Student Answer: This is another criterion we may use to determine the number of PC’s to use.

plot(summary(pc_esteem.10)$importance[3, ], pch=16, ylab="Cumulative PVE",
xlab="Number of PC's",main="Scree Plot of Cumulative PVE for NLSY79")

From the plot of cumulative PVE, we see that about 60% of the variance in data is explained by the first two principal components. Hence, if we want to be really conservative (say, make sure 90% of cumulative PVE is covered), then using 6 PC should suffice.

g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data.  Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)

Student Answer: Visualize the PC scores together with loadings of original variables. We may take a look at the correlation structure from it:
lim <- c(-.10, .10) 

biplot(pc_esteem.10,
       xlim=lim,
       ylim=lim,
       main="Biplot of the PC's")
abline(v=0, h=0)

In this plot, x-axis is PC1, y-axis is PC2, and top-axis indicates the proportion to loadings for PC1, and right y-axis indicates proportion to loadings for PC2.

Interpretation of PC1 and PC2:

If total scores are comparable, higher PC2 (and lower PC1) implies strong esteem response from the last three tests while lower PC2 (and higher PC1) implies vibrant esteem response from the first seven tests.

We observe that the last three questions tend to actually be the respondents’ euphemisms of wishing to be more achieving and hence become more self-esteemed. Thus, it seems that higher PC2 implies stronger feelings of the need to become more self-esteemed.

  1. Apply k-means to cluster subjects on the original esteem scores

    1. Find a reasonable number of clusters using within sum of squared with elbow rules.

    Student answer:

    Elbow rule: The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use

## minimize total intra-cluster variation
## k-means clustering
set.seed(0)

# function to compute total within-cluster sum of square
# the function kmeans directly evokes the package to use it
wss_esteem <- function(esteem_data, k) {
kmeans(esteem_data, k, nstart = 10)$tot.withinss
}

# we want to test out the within-cluster sum of square to be 1 to 15
k.values_esteem <- 2:15

# now we apply within-cluster values to be 
wss_values_esteem <- sapply(k.values_esteem,function(k) kmeans(esteem_data[,-1], centers = k)$tot.withinss)

## then we plot the total within-clusters sum of squares against number of clusters
## 

plot(k.values_esteem, wss_values_esteem,
       type="b", pch = 19, frame = FALSE,
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares_Esteem")

And furthermore, as demonstrated in the class, it may be even easier to decide the optimal number of clusters by using the factoextra package.

library('factoextra')
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(esteem_data[,-1], kmeans, method = "wss")

Notice how increasing the k value from 1 to 2 drastically lowers the average distance of each data point to its respective centroid, but as we continue increasing k the improvements begin to decrease asymptotically. The ideal k value is found at the elbow of such a plot, here picking k=3 is appropriate.

b) Can you summarize common features within each cluster?

Student answer: below we find the k-means data such that we can see the details within each cluster.

set.seed(130) # want to make sure our data is replicable

esteem.kmeans <- esteem_data %>%
kmeans(centers = 3)
str(esteem.kmeans)
## List of 9
##  $ cluster     : int [1:2431] 1 1 3 1 3 3 3 2 3 3 ...
##  $ centers     : num [1:3, 1:10] 3.87 3.12 3.88 3.84 3.09 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:10] "Esteem87_1" "Esteem87_2" "Esteem87_3" "Esteem87_4" ...
##  $ totss       : num 8775
##  $ withinss    : num [1:3] 2139 1683 1201
##  $ tot.withinss: num 5024
##  $ betweenss   : num 3751
##  $ size        : int [1:3] 784 822 825
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
### Summarise esteem scores by clusters
esteem_data$cluster <- esteem.kmeans$cluster
esteem_data %>% group_by(cluster) %>% 
  summarise(across(everything(), list(mean)))
esteem_data %>% group_by(cluster) %>% 
  summarise(across(everything(), list(median)))
esteem_data %>% group_by(cluster) %>% 
  summarise(across(everything(), list(min)))
esteem_data %>% group_by(cluster) %>% 
  summarise(across(everything(), list(max)))
esteem_data %>% group_by(cluster) %>% 
  summarise(across(everything(), list(quantile)))
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
## # A tibble: 3 x 11
##   cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## *   <int>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
## 1       1         3.87         3.84         3.71         3.63         3.62
## 2       2         3.12         3.09         3.10         3.05         3.01
## 3       3         3.88         3.88         3.93         3.81         3.96
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## #   Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>, Esteem87_10_1 <dbl>
## # A tibble: 3 x 11
##   cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## *   <int>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
## 1       1            4            4            4            4            4
## 2       2            3            3            3            3            3
## 3       3            4            4            4            4            4
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## #   Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>, Esteem87_10_1 <dbl>
## # A tibble: 3 x 11
##   cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## *   <int>        <dbl>        <dbl>        <int>        <dbl>        <int>
## 1       1            3            3            1            2            1
## 2       2            1            1            1            1            1
## 3       3            2            3            1            1            1
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## #   Esteem87_8_1 <int>, Esteem87_9_1 <int>, Esteem87_10_1 <int>
## # A tibble: 3 x 11
##   cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## *   <int>        <dbl>        <dbl>        <int>        <dbl>        <int>
## 1       1            4            4            4            4            4
## 2       2            4            4            4            4            4
## 3       3            4            4            4            4            4
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## #   Esteem87_8_1 <int>, Esteem87_9_1 <int>, Esteem87_10_1 <int>
## # A tibble: 15 x 11
## # Groups:   cluster [3]
##    cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
##      <int>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
##  1       1            3            3            1            2            1
##  2       1            4            4            3            3            3
##  3       1            4            4            4            4            4
##  4       1            4            4            4            4            4
##  5       1            4            4            4            4            4
##  6       2            1            1            1            1            1
##  7       2            3            3            3            3            3
##  8       2            3            3            3            3            3
##  9       2            3            3            3            3            3
## 10       2            4            4            4            4            4
## 11       3            2            3            1            1            1
## 12       3            4            4            4            4            4
## 13       3            4            4            4            4            4
## 14       3            4            4            4            4            4
## 15       3            4            4            4            4            4
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## #   Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>, Esteem87_10_1 <dbl>
NLSY79$cluster <- esteem.kmeans$cluster
NLSY79 %>% group_by(cluster) %>% summarise(across(everything(), list(mean)))
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
NLSY79 %>% group_by(cluster) %>% summarise(mean(Income87),mean(Gender),mean(AFQT),mean(MotherEd),mean(FatherEd))
## Warning in mean.default(Gender): argument is not numeric or logical: returning
## NA
## Warning in mean.default(Gender): argument is not numeric or logical: returning
## NA

## Warning in mean.default(Gender): argument is not numeric or logical: returning
## NA
## # A tibble: 3 x 47
##   cluster Subject_1 Gender_1 Education05_1 Income87_1 Job05_1 Income05_1
## *   <int>     <dbl>    <dbl>         <dbl>      <dbl>   <dbl>      <dbl>
## 1       1     3598.       NA          14.2     12633.      NA     48429.
## 2       2     3527.       NA          13.0     11555.      NA     37906.
## 3       3     3393.       NA          14.6     15964.      NA     61818.
## # … with 40 more variables: Weight05_1 <dbl>, HeightFeet05_1 <dbl>,
## #   HeightInch05_1 <dbl>, Imagazine_1 <dbl>, Inewspaper_1 <dbl>,
## #   Ilibrary_1 <dbl>, MotherEd_1 <dbl>, FatherEd_1 <dbl>,
## #   FamilyIncome78_1 <dbl>, Science_1 <dbl>, Arith_1 <dbl>, Word_1 <dbl>,
## #   Parag_1 <dbl>, Number_1 <dbl>, Coding_1 <dbl>, Auto_1 <dbl>, Math_1 <dbl>,
## #   Mechanic_1 <dbl>, Elec_1 <dbl>, AFQT_1 <dbl>, Esteem81_1_1 <dbl>,
## #   Esteem81_2_1 <dbl>, Esteem81_3_1 <dbl>, Esteem81_4_1 <dbl>,
## #   Esteem81_5_1 <dbl>, Esteem81_6_1 <dbl>, Esteem81_7_1 <dbl>,
## #   Esteem81_8_1 <dbl>, Esteem81_9_1 <dbl>, Esteem81_10_1 <dbl>,
## #   Esteem87_1_1 <dbl>, Esteem87_2_1 <dbl>, Esteem87_3_1 <dbl>,
## #   Esteem87_4_1 <dbl>, Esteem87_5_1 <dbl>, Esteem87_6_1 <dbl>,
## #   Esteem87_7_1 <dbl>, Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>,
## #   Esteem87_10_1 <dbl>
## # A tibble: 3 x 6
##   cluster `mean(Income87)` `mean(Gender)` `mean(AFQT)` `mean(MotherEd)`
## *   <int>            <dbl>          <dbl>        <dbl>            <dbl>
## 1       1           12633.             NA         57.7             11.8
## 2       2           11555.             NA         44.2             11.1
## 3       3           15964.             NA         62.2             12.2
## # … with 1 more variable: `mean(FatherEd)` <dbl>

(Disregard the NA, since Job is actually character-filled.)

Discussion:

From the code above, we basically see comparing cluster 1, 2, 3, we see that cluster 2 respondents tend to have a lower score in AFQT, lower mean income in 1987, and lower family education (both parents). This group represents the cluster with lower socioeconomic conditions and status.

By contrast, we see that cluster 3 respondents tend to have higher AFQT scores, higher mean income, and higher family education (both parents). This group represents the cluster with higher socioeconomic conditions and status.

In addition, we tend to see that the PC1, PC2 scores tend to concentrate across the same centroid in each of the cluster. Specifically, we tend to find that groups in cluster 3 tend to have the largest PC1, PC2 scores, and cluster 2 tends to have the smallest PC1, PC2 scores. The further details of what characteristics PC1, and PC2 represent will be further elaborated in part(c) and (d).

c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.

Student answer:

We can try to impose the cluster labels over PC1 vs PC2 plot.

K-means separates groups by PC1. Pairing up a variable with the PC pairs of the esteem scores can generate some valuable grouping insights.

For example,

pair1: Education with PC1, PC2

data.frame(esteem = NLSY79$Education05, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>% 
  ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point() + ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")
## Warning: ggrepel: 2343 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From this pairing, we see that people with more education tend to fall into cluster 3, which has a relatively high (positive) PC1 score but a low (negative) PC2 score. People with less education tend to be more likely grouped into cluster 1, which has a relatively low (negative) PC1 score but a high (positive) PC2 score. However, this correlation with clustering group is not very strong as seen from the plot, since there are many people with higher years of education also lie in cluster 1.

pair2: Gender with PC1, PC2

data.frame(esteem = NLSY79$Gender, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>% 
  ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point()+ ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")
## Warning: ggrepel: 2391 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Here we see that women are way more likely to tend to fall in cluster 3(lower(negative) PC1 score) more likely than male, and male are more likely to fall in cluster 1 (high PC1 score). In addition, it seems that women and men do not significantly differ in PC2 scores, but men generally have higher PC1 scores. This might explain that at year of 1987 when the experiement was carried out, women are far less empowered than today so they have lower level of self-esteem reported.

pair3: Income with PC1, PC2

data.frame(esteem = NLSY79$Income87, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>% 
  ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point()+ ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")
## Warning: ggrepel: 2387 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From this plot, we don’t find significant relationships between income and clustering of esteem PC scores, neither PC1 nor PC2. This suggests that income may not necessarily have strong impacts on level of esteem.

pair4: Weight with PC1, PC2

data.frame(esteem = NLSY79$Weight05, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>% 
  ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point() + ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")
## Warning: ggrepel: 2380 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Note that we may find the three plots to be the same, since we are conducting the clustering using PC1, PC2, but remember that in each group, we are evaluating the different variables’ values (education, income, weight).

And hence we find that respondents with higher weights might be more likely to fall into cluster 1 or cluster 3, which have relatively lower PC scores. This may explain that higher weights might harm the level of self-esteem. There’s no clear relationship between weight and PC2 scores’ clustering effects.

  1. We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.

    1. Prepare possible factors:
    • Personal information: gender, education, log(income), job type, Body mass index as a measure of health (The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m²)

    • Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd, FatherEd, FamilyIncome78. Do set indicators Imagazine and Ilibrary as factors

    • Use PC1 of SVABS as level of intelligence

Student Answer:

Use PC1 of SVABS as level of intelligence is a proxy for intelligence variables. SVABS is just an acceptable estimation/respresentation of intelligence.

The preparation of variables as following:

# job type: it's character
class(NLSY79$Job05)
length(unique(NLSY79$Job05))

NLSY79$Job05[NLSY79$Job05=="4700 TO 4960: Sales and Related Workers"] <-1
NLSY79$Job05[NLSY79$Job05=="10 TO 430: Executive, Administrative and Managerial Occupations"] <-2
NLSY79$Job05[NLSY79$Job05=="7900 TO 8960: Setters, Operators and Tenders"] <-3
NLSY79$Job05[NLSY79$Job05=="5000 TO 5930: Office and Administrative Support Workers"] <-4
NLSY79$Job05[NLSY79$Job05=="4200 TO 4250: Cleaning and Building Service Occupations"] <-5
NLSY79$Job05[NLSY79$Job05=="6200 TO 6940: Construction Trade and Extraction Workers"] <-6
NLSY79$Job05[NLSY79$Job05=="1300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians"] <-7
NLSY79$Job05[NLSY79$Job05=="1800 TO 1860: Social Scientists and Related Workers"] <-8
NLSY79$Job05[NLSY79$Job05=="7000 TO 7620: Installation, Maintenance and Repairs Workers"] <-9
NLSY79$Job05[NLSY79$Job05=="3000 TO 3260: Health Diagnosing and Treating Practitioners"] <-10
NLSY79$Job05[NLSY79$Job05=="1000 TO 1240: Mathematical and Computer Scientists"] <-11
NLSY79$Job05[NLSY79$Job05=="3300 TO 3650: Health Care Technical and Support Occupations"] <-12
NLSY79$Job05[NLSY79$Job05=="500 TO 950: Management Related Occupations"] <-13
NLSY79$Job05[NLSY79$Job05=="4500 TO 4650: Personal Care and Service Workers"] <-14
NLSY79$Job05[NLSY79$Job05=="2200 TO 2340: Teachers"] <-15
NLSY79$Job05[NLSY79$Job05=="2800 TO 2960: Media and Communications Workers"] <-16
NLSY79$Job05[NLSY79$Job05=="2400 TO 2550: Education, Training and Library Workers"] <-17
NLSY79$Job05[NLSY79$Job05=="2100 TO 2150: Lawyers, Judges and Legal Support Workers"] <-18
NLSY79$Job05[NLSY79$Job05=="4000 TO 4160: Food Preparation and Serving Related Occupations"]<-19
NLSY79$Job05[NLSY79$Job05=="3700 TO 3950: Protective Service Occupations"]<-20
NLSY79$Job05[NLSY79$Job05=="9000 TO 9750: Transportation and Material Moving Workers"]<-21
NLSY79$Job05[NLSY79$Job05=="2000 TO 2060: Counselors, Sociala and Religious Workers"]<-22
NLSY79$Job05[NLSY79$Job05==""]<-23 # unspecified (missing values)
NLSY79$Job05[NLSY79$Job05=="7700 TO 7750: Production and Operating Workers"]<-24
NLSY79$Job05[NLSY79$Job05=="6000 TO 6130: Farming, Fishing and Forestry Occupations"]<-25
NLSY79$Job05[NLSY79$Job05=="2600 TO 2760: Entertainers and Performers, Sports and Related Workers"]<-26
NLSY79$Job05[NLSY79$Job05=="1900 TO 1960: Life, Physical and Social Science Technicians"]<-27
NLSY79$Job05[NLSY79$Job05=="4300 TO 4430: Entertainment Attendants and Related Workers"]<-28
NLSY79$Job05[NLSY79$Job05=="1600 TO 1760: Physical Scientists"]<-29
NLSY79$Job05[NLSY79$Job05=="7800 TO 7850: Food Preparation Occupations"]<-30
NLSY79$Job05[NLSY79$Job05=="9990: Uncodeable"]<-31

NLSY79$Job05<-as.factor(NLSY79$Job05)
## we may want to treat job as categorical variable of types
## however, even it's true that we may convert this characteristic job variable into 31 types, however, we may hardly tell that what's the rationale to give them values. In the regression model of part 6.(b), we will not include this job variable except for the model with largest number 
## [1] "character"
## [1] 31
Imagazine<-as.factor(NLSY79$Imagazine)
Ilibrary<-as.factor(NLSY79$Ilibrary)
## BMI: the body mass divided by the square of the body height, a measure of health
## BMI Formula: divide your weight (in pounds) by your height (in inches) x your height (in inches) and then multiply by 703
## someone with height 5'5'' has HeightFeet05=5, and HeightInch05=5
# 12*NLSY79$HeightFeet05+NLSY79$HeightInch05 is the height of a person in inches
BMI<-703*NLSY79$Weight05/(12*NLSY79$HeightFeet05+NLSY79$HeightInch05)^2
## SVABS test scores include 
SVABS_score<-NLSY79[,c(16:25)]
pc_SVABS.10<-prcomp(SVABS_score,scale=FALSE) ## by default, center=True but scale=FALSE
pc1_SVABS<-pc_SVABS.10$x[,c(1)]
b)   Run a few regression models between PC1 of all the esteem scores and factors listed in a). Find a final best model with your own criterion. 

  - How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met. 
    
  - Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem. 
    

Student Answer:

We present the multivariate regression model of pc1_esteem vs all factors except for job (since there are 31 categories):

Then we will evaluate the model choice by considering: 1. if the regressors are significant 2. if the residuals, namely R-squared, is acceptably small

## response variable PC1 as the summary of one's esteem scores
model1_esteem<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+NLSY79$MotherEd+NLSY79$FatherEd+NLSY79$Inewspaper+NLSY79$FamilyIncome78+Imagazine+Ilibrary+BMI+pc1_SVABS)
summary(model1_esteem)
## 
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 + 
##     NLSY79$Income87 + NLSY79$MotherEd + NLSY79$FatherEd + NLSY79$Inewspaper + 
##     NLSY79$FamilyIncome78 + Imagazine + Ilibrary + BMI + pc1_SVABS)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.522 -0.959 -0.067  1.001  2.937 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.74e+00   2.16e-01   -8.04  1.4e-15 ***
## NLSY79$Gendermale      1.20e-01   5.28e-02    2.27  0.02339 *  
## NLSY79$Education05     7.61e-02   1.23e-02    6.21  6.1e-10 ***
## NLSY79$Income87        1.29e-05   2.33e-06    5.52  3.7e-08 ***
## NLSY79$MotherEd        1.77e-02   1.28e-02    1.38  0.16667    
## NLSY79$FatherEd        4.17e-03   9.52e-03    0.44  0.66159    
## NLSY79$Inewspaper      1.62e-01   7.93e-02    2.05  0.04091 *  
## NLSY79$FamilyIncome78  1.44e-06   2.00e-06    0.72  0.47221    
## Imagazine1             5.19e-02   6.15e-02    0.84  0.39887    
## Ilibrary1              9.16e-02   6.30e-02    1.46  0.14575    
## BMI                   -3.15e-03   3.82e-03   -0.82  0.40978    
## pc1_SVABS              6.32e-03   1.63e-03    3.88  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.23 on 2419 degrees of freedom
## Multiple R-squared:  0.106,  Adjusted R-squared:  0.102 
## F-statistic: 26.1 on 11 and 2419 DF,  p-value: <2e-16

From the results above, we see that the R-squared is very low: about only 0.106. And the residual standard error is also very low, and the p-value of the entire model is minimal. Thus, we see that this model fit overall is very good.

But if we look at the linearity assumptions individually, we notice that father’s education, Family Income, BMI are clearly not significant. Mother’s education and posession of library card are only mildly significant.

And it seems that except for BMI, all these variables are positively associated with the PC1 of the 10 esteem scores.

so we want to make some modifications. We remove the assumption for these insignificant variables, and consider a restricted model:

## restricted model1 after dropping the obviously insignificant variables
model1_esteem_restricted<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+NLSY79$MotherEd+NLSY79$Inewspaper+Imagazine+Ilibrary+pc1_SVABS)
summary(model1_esteem_restricted)
## 
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 + 
##     NLSY79$Income87 + NLSY79$MotherEd + NLSY79$Inewspaper + Imagazine + 
##     Ilibrary + pc1_SVABS)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.520 -0.957 -0.070  1.008  2.919 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.84e+00   1.84e-01   -9.96  < 2e-16 ***
## NLSY79$Gendermale   1.19e-01   5.25e-02    2.27    0.023 *  
## NLSY79$Education05  7.77e-02   1.21e-02    6.42  1.7e-10 ***
## NLSY79$Income87     1.30e-05   2.31e-06    5.65  1.8e-08 ***
## NLSY79$MotherEd     2.20e-02   1.13e-02    1.95    0.051 .  
## NLSY79$Inewspaper   1.69e-01   7.87e-02    2.15    0.032 *  
## Imagazine1          6.20e-02   6.08e-02    1.02    0.308    
## Ilibrary1           9.50e-02   6.26e-02    1.52    0.129    
## pc1_SVABS           6.54e-03   1.62e-03    4.05  5.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.23 on 2422 degrees of freedom
## Multiple R-squared:  0.105,  Adjusted R-squared:  0.102 
## F-statistic: 35.7 on 8 and 2422 DF,  p-value: <2e-16

And comparing the R-squared, p-value, we find that except for Magazine (its significance decreased after we impose restrictions on model1), all the other variables remain very significant, such as the person’s income and education (which intuitively makes sense). But this model doesn’t substantially improve the overall significance.

But with model1 and model1_restricted, we can already tell that the personal information has a lot more significant influence on self-esteem compared to the person’s family upbringing.

Or, we may consider trying to consider the interaction of parents’ education variable. This can potentially be helpful since they are both 0-1 variables, and reflect the family upbringing of individuals.

parental_educ<-NLSY79$MotherEd*NLSY79$FatherEd
model2_esteem<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+parental_educ+NLSY79$Inewspaper+NLSY79$FamilyIncome78+Imagazine+Ilibrary+BMI+pc1_SVABS)
summary(model2_esteem)
## 
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 + 
##     NLSY79$Income87 + parental_educ + NLSY79$Inewspaper + NLSY79$FamilyIncome78 + 
##     Imagazine + Ilibrary + BMI + pc1_SVABS)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.530 -0.952 -0.061  1.009  2.937 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.60e+00   2.05e-01   -7.84  6.7e-15 ***
## NLSY79$Gendermale      1.20e-01   5.28e-02    2.28    0.023 *  
## NLSY79$Education05     7.74e-02   1.23e-02    6.27  4.2e-10 ***
## NLSY79$Income87        1.29e-05   2.33e-06    5.54  3.4e-08 ***
## parental_educ          6.06e-04   4.98e-04    1.22    0.224    
## NLSY79$Inewspaper      1.75e-01   7.87e-02    2.22    0.026 *  
## NLSY79$FamilyIncome78  1.58e-06   2.00e-06    0.79    0.430    
## Imagazine1             5.69e-02   6.14e-02    0.93    0.354    
## Ilibrary1              9.61e-02   6.29e-02    1.53    0.127    
## BMI                   -3.29e-03   3.82e-03   -0.86    0.389    
## pc1_SVABS              6.47e-03   1.62e-03    3.98  7.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.23 on 2420 degrees of freedom
## Multiple R-squared:  0.105,  Adjusted R-squared:  0.102 
## F-statistic: 28.5 on 10 and 2420 DF,  p-value: <2e-16

This model2 seems to improve the significance on parental education level. But it has not improved other variables’ significance as well.

Lastly, we try the most complicated model: where we see the influence of each type of job categories (Casted in part 6.(a)).

## response variable PC1 as the summary of one's esteem scores
model3_esteem<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+NLSY79$MotherEd+NLSY79$FatherEd+NLSY79$Inewspaper+NLSY79$FamilyIncome78+Imagazine+Ilibrary+BMI+pc1_SVABS+NLSY79$Job05)
summary(model3_esteem)
Anova(model3_esteem)
## 
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 + 
##     NLSY79$Income87 + NLSY79$MotherEd + NLSY79$FatherEd + NLSY79$Inewspaper + 
##     NLSY79$FamilyIncome78 + Imagazine + Ilibrary + BMI + pc1_SVABS + 
##     NLSY79$Job05)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.204 -0.904 -0.042  0.984  3.142 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.41e+00   2.47e-01   -5.73  1.1e-08 ***
## NLSY79$Gendermale      1.63e-01   6.06e-02    2.69   0.0072 ** 
## NLSY79$Education05     6.11e-02   1.36e-02    4.49  7.6e-06 ***
## NLSY79$Income87        1.15e-05   2.34e-06    4.89  1.1e-06 ***
## NLSY79$MotherEd        1.59e-02   1.28e-02    1.24   0.2141    
## NLSY79$FatherEd        3.01e-03   9.49e-03    0.32   0.7512    
## NLSY79$Inewspaper      1.82e-01   7.94e-02    2.29   0.0220 *  
## NLSY79$FamilyIncome78  1.08e-06   1.99e-06    0.54   0.5872    
## Imagazine1             2.19e-02   6.13e-02    0.36   0.7206    
## Ilibrary1              7.79e-02   6.28e-02    1.24   0.2149    
## BMI                   -4.48e-03   3.81e-03   -1.17   0.2402    
## pc1_SVABS              5.22e-03   1.65e-03    3.17   0.0015 ** 
## NLSY79$Job0510         2.10e-01   1.70e-01    1.23   0.2174    
## NLSY79$Job0511         2.13e-01   1.76e-01    1.21   0.2266    
## NLSY79$Job0512        -3.58e-01   1.52e-01   -2.36   0.0183 *  
## NLSY79$Job0513         2.88e-01   1.46e-01    1.97   0.0494 *  
## NLSY79$Job0514         2.63e-02   2.08e-01    0.13   0.8994    
## NLSY79$Job0515        -2.07e-02   1.48e-01   -0.14   0.8887    
## NLSY79$Job0516         5.05e-02   3.50e-01    0.14   0.8851    
## NLSY79$Job0517        -4.30e-02   2.44e-01   -0.18   0.8602    
## NLSY79$Job0518         1.58e-02   3.29e-01    0.05   0.9617    
## NLSY79$Job0519        -3.58e-01   1.73e-01   -2.07   0.0383 *  
## NLSY79$Job052          1.67e-01   1.07e-01    1.56   0.1182    
## NLSY79$Job0520         4.17e-01   1.87e-01    2.23   0.0261 *  
## NLSY79$Job0521        -3.39e-01   1.44e-01   -2.35   0.0190 *  
## NLSY79$Job0522        -1.01e-01   2.12e-01   -0.48   0.6336    
## NLSY79$Job0523        -2.08e-01   1.85e-01   -1.13   0.2593    
## NLSY79$Job0524        -6.54e-02   1.95e-01   -0.34   0.7376    
## NLSY79$Job0525        -4.10e-01   4.19e-01   -0.98   0.3278    
## NLSY79$Job0526         4.86e-01   2.64e-01    1.84   0.0657 .  
## NLSY79$Job0527        -2.52e-02   4.69e-01   -0.05   0.9571    
## NLSY79$Job0528        -9.65e-01   3.96e-01   -2.44   0.0149 *  
## NLSY79$Job0529        -1.20e+00   6.17e-01   -1.95   0.0517 .  
## NLSY79$Job053         -8.27e-02   1.46e-01   -0.57   0.5708    
## NLSY79$Job0530        -1.96e-01   6.16e-01   -0.32   0.7504    
## NLSY79$Job0531        -4.04e-01   1.22e+00   -0.33   0.7409    
## NLSY79$Job054          6.52e-02   1.09e-01    0.60   0.5501    
## NLSY79$Job055         -4.02e-01   1.75e-01   -2.30   0.0216 *  
## NLSY79$Job056         -2.21e-01   1.40e-01   -1.58   0.1132    
## NLSY79$Job057         -2.76e-01   1.89e-01   -1.46   0.1438    
## NLSY79$Job058         -5.20e-01   5.07e-01   -1.03   0.3052    
## NLSY79$Job059         -1.07e-01   1.48e-01   -0.72   0.4693    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.22 on 2389 degrees of freedom
## Multiple R-squared:  0.133,  Adjusted R-squared:  0.118 
## F-statistic:  8.9 on 41 and 2389 DF,  p-value: <2e-16
## 
## Anova Table (Type II tests)
## 
## Response: pc1_esteem
##                       Sum Sq   Df F value  Pr(>F)    
## NLSY79$Gender             11    1    7.23  0.0072 ** 
## NLSY79$Education05        30    1   20.12 7.6e-06 ***
## NLSY79$Income87           36    1   23.95 1.1e-06 ***
## NLSY79$MotherEd            2    1    1.54  0.2141    
## NLSY79$FatherEd            0    1    0.10  0.7512    
## NLSY79$Inewspaper          8    1    5.25  0.0220 *  
## NLSY79$FamilyIncome78      0    1    0.29  0.5872    
## Imagazine                  0    1    0.13  0.7206    
## Ilibrary                   2    1    1.54  0.2149    
## BMI                        2    1    1.38  0.2402    
## pc1_SVABS                 15    1   10.06  0.0015 ** 
## NLSY79$Job05             109   30    2.44 2.2e-05 ***
## Residuals               3548 2389                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that this model has one advantage: it allows us to see the impact of different careers on the workers’ level of self-esteem. The influence of careers is significant. But if we perform ANOVA on all these 31 job categorical variables, then each of the result isn’t significant. Thus, we can say job category has a significant influence, yet we need more information to get specific results of each variable’s influence.

In addition, this model has higher residual standard error and higher multiple R-squared. The number of insignificant variables also increase. For this reason, it’s not preferred.

In summary, across model1, model1_restricted, model2, model3, we see that we would prefer model1_restricted as described above.

As a summary, we see that Gender (being male)+years of education, income, parental education,newspaper, usage of magazine, library, and personal intelligence are positively associated with self-esteem.

2 Case study 2: Breast cancer sub-type

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).

In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.

  • Luminal A cancers are low-grade, tend to grow slowly and have the best prognosis.
  • Luminal B cancers generally grow slightly faster than luminal A cancers and their prognosis is slightly worse.
  • HER2-enriched cancers tend to grow faster than luminal cancers and can have a worse prognosis, but they are often successfully treated with targeted therapies aimed at the HER2 protein.
  • Basal-like breast cancers or triple negative breast cancers do not have the three receptors that the other sub-types have so have fewer treatment options.

We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.

We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().

  1. Summary and transformation
  1. How many patients we have in each sub-type?

Student Answer: 208 patients for Basal sub-type, 91 for Her2, 628 for LumA, and 233 for LumB.

brca_subtype %>%
  group_by(BRCA_Subtype_PAM50) %>%
  summarise(n())
## # A tibble: 4 x 2
##   BRCA_Subtype_PAM50 `n()`
## * <chr>              <int>
## 1 Basal                208
## 2 Her2                  91
## 3 LumA                 628
## 4 LumB                 233
  1. Randomly pick 5 genes and plot the histogram by each sub-type.
set.seed(2048)
n_gene <- 5

sample_idx <- sample(ncol(brca), n_gene)
plot_data  <- select(brca, all_of(sample_idx))
plot_data$TYPE <- brca_subtype$BRCA_Subtype_PAM50

plot_data %>%
  pivot_longer(cols = 1:n_gene)  %>%
  group_by(TYPE, name) %>%
  ggplot(aes(x = value)) +
  geom_histogram(aes(y=..density.., fill = name)) +
  facet_grid(name ~ TYPE, scales = "free") +
  labs(x = 'Value', y = 'Density') +
  theme_bw() +
  theme(legend.position = "none")

  1. Remove gene with zero count and no variability. Then apply logarithmic transform.
sel_cols <- which((colSums(abs(brca)) != 0) & (sapply(brca, var) > 1e-10))
brca_sub <- brca[, sel_cols, with=F]
brca_sub <- log2(brca_sub + 1e-10)
  1. Apply kmeans on the transformed dataset with 4 centers and output the discrepancy table between the real sub-type brca_subtype and the cluster labels.
brca_sub_kmeans <- kmeans(x = brca_sub, centers = 4, nstart = 50)
table(brca_subtype$BRCA_Subtype_PAM50, brca_sub_kmeans$cluster)
##        
##           1   2   3   4
##   Basal  17   1 187   3
##   Her2    9  43  16  23
##   LumA   86 395   0 147
##   LumB   22 104   2 105
  1. Spectrum clustering: to scale or not to scale?
  1. Apply PCA on the centered and scaled dataset. How many PCs should we use and why? You are encouraged to use irlba::irlba().

Student Answer: The top 4 PCs seem to explain a lot more variance than the rest of the PCs with a clear elbow point at 4 PCs. Thus 4 PCs would be a reasonable choice, although we might also want to take into account of the performance of clustering.

pca_ct_sc <- prcomp(brca_sub, center = T, scale. = T)
pca_ct <- prcomp(brca_sub, center = T, scale. = F)
pve <- summary(pca_ct_sc)$importance[2, 1:10]

data.frame("index"=1:10, "pve"=pve) %>%
  ggplot(aes(x = index, y = pve)) +
  geom_point() + geom_line() +
  labs(title = "PVE Centered and Scaled", x = '# of PCs', y = 'Proportion of Variance Explained') +
  scale_x_continuous(breaks = 1:10) +
  theme_bw()

  1. Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering propose? Why? (Hint: to put plots side by side, use gridExtra::grid.arrange() or ggpubr::ggrrange() or egg::ggrrange() for ggplots; use fig.show="hold" as chunk option for base plots)

Student Answer: We should NOT scale for clustering purpose. Since from the two plots below, the unscaled PCs can produce much better separation of different cancer types (especially the difference between Basal and Her2 Type).

plot_1 <-
  data.table(x = pca_ct_sc$x[, 1],
           y = pca_ct_sc$x[, 2],
           label = as.factor(brca_subtype$BRCA_Subtype_PAM50)) %>%
  ggplot() +
  geom_point(aes(x = x, y = y, col = label)) +
  labs(title = "Centered and Scaled", x = 'PC 1', y = 'PC 2')

plot_2 <-
  data.table(x = pca_ct$x[, 1],
           y = pca_ct$x[, 2],
           label = as.factor(brca_subtype$BRCA_Subtype_PAM50),
           cluster = as.factor(brca_sub_kmeans$cluster)) %>%
  ggplot() +
  geom_point(aes(x = x, y = y, col = label)) +
  labs(title = "Centered and Unscaled", x = 'PC 1', y = 'PC 2')

grid.arrange(plot_1, plot_2, ncol = 2)

  1. Spectrum clustering: center but do not scale the data
  1. Use the first 4 PCs of the centered and unscaled data and apply kmeans. Find a reasonable number of clusters using within sum of squared with the elbow rule.

Student Answer: K = 4 is a reasonable number for cluster according to the elbow rule (although in our particular case the location of the “elbow” is slightly ambiguous).

wss <- function(df, k)
{
  kmeans(df, k, nstart = 10)$tot.withinss
}

num_k <- 2:15
wss_values <- map_dbl(num_k, function(k) wss(pca_ct$x[, 1:4], k))

data.frame("k" = num_k, "wss" = wss_values) %>%
  ggplot(aes(x = k, y = wss)) +
  geom_point() + geom_line() +
  labs(x = '# of Clusters K', y = 'WSS')

  1. Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.

Student Answer: The clustering algorithm can successfully distinguish between the Basal vs. LumA/B subtype. A large portion of LumA subtype is also isolated from the rest. However, the Her2 subtype cannot be well separated from the rest, and there are also many mislabeling within LumA/B subtype.

n_cluster <- 4
n_PCs <- 4

pca_ct_kmeans <- kmeans(x = pca_ct$x[, 1:n_PCs], centers = n_cluster, nstart = 100)
centroids <- data.frame(pca_ct_kmeans$centers[, 1:2])
centroids$cluster = as.factor(1:n_cluster)

plot_data <- data.table(x = pca_ct$x[, 1],
                        y = pca_ct$x[, 2],
                        label = as.factor(brca_subtype$BRCA_Subtype_PAM50),
                        cluster = as.factor(pca_ct_kmeans$cluster))

ggplot() +
geom_point(data = plot_data, aes(x = x, y = y, col = label, shape = cluster)) +
geom_point(data = centroids, aes(x = PC1, y = PC2, shape = cluster)) +
labs(title = "Centered and Unscaled", x = 'PC 1', y = 'PC 2')

  1. Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?

Student Answer: In this case, PCA did NOT help in kmeans clustering in terms of the quality of the cluster. The only improvement is that the kmeans algorithm runs much faster with only 4 PCs. In general, PCA might help clustering since it picks up dimensions that are more important, thus can help reduce the noise in clustering.

plot_1 <-
  ggplot() +
  geom_point(data = plot_data, aes(x = x, y = y, col = label, shape = cluster)) +
  geom_point(data = centroids, aes(x = PC1, y = PC2, shape = cluster)) +
  labs(title = "kmeans with 4 PCs", x = 'PC 1', y = 'PC 2')

plot_2 <-
  data.table(x = pca_ct$x[, 1],
             y = pca_ct$x[, 2],
             label = as.factor(brca_subtype$BRCA_Subtype_PAM50),
             cluster = as.factor(brca_sub_kmeans$cluster)) %>%
  ggplot() +
  geom_point(aes(x = x, y = y, col = label, shape = cluster)) +
  labs(title = "kmeans with Original Data", x = 'PC 1', y = 'PC 2')

grid.arrange(plot_1, plot_2, ncol = 2)

print("Kmeans with Original Data")
## [1] "Kmeans with Original Data"
table(brca_subtype$BRCA_Subtype_PAM50, brca_sub_kmeans$cluster)
##        
##           1   2   3   4
##   Basal  17   1 187   3
##   Her2    9  43  16  23
##   LumA   86 395   0 147
##   LumB   22 104   2 105
print("Kmeans with 4PCs")
## [1] "Kmeans with 4PCs"
table(brca_subtype$BRCA_Subtype_PAM50, pca_ct_kmeans$cluster)
##        
##           1   2   3   4
##   Basal 187   1  17   3
##   Her2   28  27   9  27
##   LumA    0 376  91 161
##   LumB    2  99  22 110
  1. Now we have an x patient with breast cancer but with unknown sub-type. We have this patient’s mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have?

Student Answer: This patient is most likely to have the Basal subtype cancer. The Euclidean distance to each of the centroid is 122, 469, 496, 508, respectively.

x_patient <- fread("data/brca_x_patient.csv")
x_patient <- log2(x_patient[, sel_cols, with=F] + 1e-10) - pca_ct$center

patient = data.frame("PC1" = as.matrix(x_patient[1, ]) %*% as.matrix(pca_ct$rotation[, 1]),
                     "PC2" = as.matrix(x_patient[1, ]) %*% as.matrix(pca_ct$rotation[, 2]))

ggplot() +
geom_point(data = plot_data, aes(x = x, y = y, col = label, shape = cluster)) +
geom_point(data = centroids, aes(x = PC1, y = PC2, shape = cluster)) +
geom_point(data = patient, aes(x = PC1, y = PC2), shape = 21, size = 4, fill = 'red') +
ggrepel::geom_label_repel(data = patient, aes(x = PC1, y = PC2), label = "Patient", size = 4) +
labs(title = "Centered and Unscaled, with New Patient", x = 'PC 1', y = 'PC 2')

for (idx in 1:4)
{
  print(norm(as.matrix(centroids[idx, 1:2]) - as.matrix(patient[1, ])))
}
## [1] 122
## [1] 496
## [1] 508
## [1] 469

3 Case study 3: Auto data set

This question utilizes the Auto dataset from ISLR. The original dataset contains 408 observations about cars. It is similar to the CARS dataset that we use in our lectures. To get the data, first install the package ISLR. The Auto dataset should be loaded automatically. We’ll use this dataset to practice the methods learn so far. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg

Get familiar with this dataset first. Tip: you can use the command ?ISLR::Auto to view a description of the dataset.

library(ISLR)
?ISLR::Auto
auto_data <- ISLR::Auto
auto_data$origin <- as.factor(auto_data$origin)

3.1 EDA

Explore the data, with particular focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.

Student Answer:

  • Out of the nine columns, only the column name and origin should contain factor type data; the other seven columns all contain numeric type data.

  • The distribution for the response variable, mpg, is right-skewed. Five out of the seven numeric variables’ distributions are right skewed.

  • There are 301 names for 392 vehicles – almost every vehicle has a unique name.

  • While the range of number of cylinders for this dataset is 3 to 8, there is no vehicle with 7 cylinders.

  • cylinders, displacement, horsepower, and weight show trends of high positive correlation with each other (all correlections between the four variables are larger than 0.7).

str(auto_data)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(auto_data)
##       mpg         cylinders     displacement   horsepower        weight    
##  Min.   : 9.0   Min.   :3.00   Min.   : 68   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.0   1st Qu.:4.00   1st Qu.:105   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.8   Median :4.00   Median :151   Median : 93.5   Median :2804  
##  Mean   :23.4   Mean   :5.47   Mean   :194   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:276   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.6   Max.   :8.00   Max.   :455   Max.   :230.0   Max.   :5140  
##                                                                            
##   acceleration       year    origin                  name    
##  Min.   : 8.0   Min.   :70   1:245   amc matador       :  5  
##  1st Qu.:13.8   1st Qu.:73   2: 68   ford pinto        :  5  
##  Median :15.5   Median :76   3: 79   toyota corolla    :  5  
##  Mean   :15.5   Mean   :76           amc gremlin       :  4  
##  3rd Qu.:17.0   3rd Qu.:79           amc hornet        :  4  
##  Max.   :24.8   Max.   :82           chevrolet chevette:  4  
##                                      (Other)           :365
length(unique(auto_data$name))
unique(auto_data$cylinders)
## [1] 301
## [1] 8 4 6 3 5
library(skimr)
skim(auto_data)
Data summary
Name auto_data
Number of rows 392
Number of columns 9
_______________________
Column type frequency:
factor 2
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
origin 0 1 FALSE 3 1: 245, 3: 79, 2: 68
name 0 1 FALSE 301 amc: 5, for: 5, toy: 5, amc: 4

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 23.45 7.81 9 17.0 22.8 29 46.6 ▆▇▆▃▁
cylinders 0 1 5.47 1.71 3 4.0 4.0 8 8.0 ▇▁▃▁▅
displacement 0 1 194.41 104.64 68 105.0 151.0 276 455.0 ▇▂▂▃▁
horsepower 0 1 104.47 38.49 46 75.0 93.5 126 230.0 ▆▇▃▁▁
weight 0 1 2977.58 849.40 1613 2225.2 2803.5 3615 5140.0 ▇▇▅▅▂
acceleration 0 1 15.54 2.76 8 13.8 15.5 17 24.8 ▁▆▇▂▁
year 0 1 75.98 3.68 70 73.0 76.0 79 82.0 ▇▆▇▆▇
hist(auto_data$mpg)

auto_data %>% select_if(is.numeric) %>% cor()
##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000
## year          0.581    -0.346       -0.370     -0.416 -0.309        0.290
##                year
## mpg           0.581
## cylinders    -0.346
## displacement -0.370
## horsepower   -0.416
## weight       -0.309
## acceleration  0.290
## year          1.000
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## The following object is masked from 'package:tidyr':
## 
##     smiths
corr.table <- auto_data %>% select_if(is.numeric) %>% cor() %>% reshape2::melt() 

corr.table %>% ggplot(aes(x=Var1, y=Var2, fill=value)) + geom_tile() +
xlab("") +
ylab("") +
guides(fill = guide_legend(title = "Correlation"))

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
auto_data %>% select_if(is.numeric) %>% ggpairs()

3.2 What effect does time have on MPG?

  1. Start with a simple regression of mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.

Student Answer:

After running a regression of mpg vs. year, the summary result is below. Yes, year is a significant variable at the .05 level.

Interpretation: according to this model, when the model year (modulo 100) increase by 1 unit (1 year), we expect on average the vehicle’s miles per gallon value will increase by 1.23.

lm_auto1 <- lm(mpg ~ year, data = auto_data)
summary(lm_auto1)
## 
## Call:
## lm(formula = mpg ~ year, data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.021  -5.441  -0.441   4.974  18.209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -70.0117     6.6452   -10.5   <2e-16 ***
## year          1.2300     0.0874    14.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.36 on 390 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.335 
## F-statistic:  198 on 1 and 390 DF,  p-value: <2e-16
  1. Add horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here.

Student Answer: Yes, year is still a significant variable at the .05 level.

Intepretation: If the horsepower of the vehicle is held constant, when the model year (modulo 100) increase by 1 unit (1 year), we expect on average the vehicle’s miles per gallon value will increase by 0.65.

lm_auto2 <- lm(mpg ~ year+horsepower, data = auto_data)
summary(lm_auto2)
## 
## Call:
## lm(formula = mpg ~ year + horsepower, data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.077  -3.078  -0.431   2.588  15.315 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.73917    5.34903   -2.38    0.018 *  
## year          0.65727    0.06626    9.92   <2e-16 ***
## horsepower   -0.13165    0.00634  -20.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.39 on 389 degrees of freedom
## Multiple R-squared:  0.685,  Adjusted R-squared:  0.684 
## F-statistic:  424 on 2 and 389 DF,  p-value: <2e-16
  1. The two 95% CI’s for the coefficient of year differ among (i) and (ii). How would you explain the difference to a non-statistician?

Student Answer:

The difference: The difference lies in whether the horsepower variable is held constant or not when estimating the 95% CI for the coefficient of year.

The 95% CI in (i) estimates the coefficient of year when not considering the variation of other variables. For example, regardless of how other variables change, the 95% CI for the coefficient of year is (1.06, 1.40). (note: the year effect is not purly from year; it is much more complex in this case)

On the other hand, the 95% CI in (ii) estimates the coefficient of year when the horsepower variable is held constant. For example, when the horsepowers of the vehicles are held constant, the 95% CI for the coefficient of year is (0.527, 0.728).

confint(lm_auto1,level = 0.95)
##              2.5 % 97.5 %
## (Intercept) -83.08  -56.9
## year          1.06    1.4
confint(lm_auto2,level = 0.95)
##               2.5 % 97.5 %
## (Intercept) -23.256 -2.223
## year          0.527  0.788
## horsepower   -0.144 -0.119
  1. Create a model with interaction by fitting lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).

Student: With a p-value <2e-16, we can argue that the interaction effect is significant at .05 level.

Year effect: If the horsepower if vehicles and the interaction between the horsepower and the year are held constant, when the model year (modulo 100) increase by 1 unit (1 year), we expect on average the vehicle’s miles per gallon value will increase by 2.19.

lm_auto3 <- lm(mpg ~ year * horsepower, data = auto_data)
summary(lm_auto3)
anova(lm_auto2,lm_auto3)
## 
## Call:
## lm(formula = mpg ~ year * horsepower, data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.349  -2.451  -0.456   2.406  14.444 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.27e+02   1.21e+01  -10.45   <2e-16 ***
## year             2.19e+00   1.61e-01   13.59   <2e-16 ***
## horsepower       1.05e+00   1.15e-01    9.06   <2e-16 ***
## year:horsepower -1.60e-02   1.56e-03  -10.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.9 on 388 degrees of freedom
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.75 
## F-statistic:  393 on 3 and 388 DF,  p-value: <2e-16
## 
## Analysis of Variance Table
## 
## Model 1: mpg ~ year + horsepower
## Model 2: mpg ~ year * horsepower
##   Res.Df  RSS Df Sum of Sq   F Pr(>F)    
## 1    389 7491                            
## 2    388 5903  1      1588 104 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3 Categorical predictors

Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.

  1. Fit a model that treats cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?

Student Answer: Yes, with coefficient p-value < 2e-16, cylinders is significant at the 0.01 level.

Effect: When the number of cylinders of the vehicle increase by 1, we expect on average the vehicle’s miles per gallon value will decrease by 3.558.

lm_auto4 <- lm(mpg ~ cylinders, data = auto_data)
summary(lm_auto4)
## 
## Call:
## lm(formula = mpg ~ cylinders, data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.241  -3.183  -0.633   2.549  17.917 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.916      0.835    51.4   <2e-16 ***
## cylinders     -3.558      0.146   -24.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.604 
## F-statistic:  597 on 1 and 390 DF,  p-value: <2e-16
  1. Fit a model that treats cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.

Student Answer:

Only when the number of cylinders is 4, with p-value 0.00027, the cylinders, or the cylinders4, is significant at the 0.01 level. When the number of cylinders is 5, 6, and 8, cylinders is not significant at the 0.01 level. The effect of cylinders is different based on different number of cylinders for the vehicle.

Effect: In this model, in which we only consider number of cylinders as (categorical) variables, when the number of cylinders for a vehicle is 4, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 8.734 higher in value.

When the number of cylinders for a vehicle is 5, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 6.817 higher in value.

When the number of cylinders for a vehicle is 6, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 0.577 lower in value.

When the number of cylinders for a vehicle is 8, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 5.587 lower in value.

lm_auto5 <- lm(mpg ~ as.factor(cylinders), data = auto_data)
summary(lm_auto5)
## 
## Call:
## lm(formula = mpg ~ as.factor(cylinders), data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.284  -2.904  -0.963   2.344  18.027 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             20.550      2.349    8.75  < 2e-16 ***
## as.factor(cylinders)4    8.734      2.373    3.68  0.00027 ***
## as.factor(cylinders)5    6.817      3.589    1.90  0.05825 .  
## as.factor(cylinders)6   -0.577      2.405   -0.24  0.81071    
## as.factor(cylinders)8   -5.587      2.395   -2.33  0.02015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.638 
## F-statistic:  173 on 4 and 387 DF,  p-value: <2e-16
  1. What are the fundamental differences between treating cylinders as a continuous and categorical variable in your models?

Student Answer:

When treating cylinders as a continuous variable, there is only one cylinders variable in the models; when treating cylinders as a categorical variable, each factor level (each number of cylinders) other than the base (cylinders = 3) become a variable in the models.

  1. Can you test the null hypothesis: fit0: mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?

Student Answer:

Using anova with Chi-square test, with p-value 3.4e-08, at 0.01 level, we reject the null hypothesis and conclude that fit0: mpg is linear in cylinders vs. fit1: mpg does not relate to cylinders as a categorical variable.

summary(lm_auto4)
summary(lm_auto5)
anova(lm_auto4,lm_auto5)
## 
## Call:
## lm(formula = mpg ~ cylinders, data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.241  -3.183  -0.633   2.549  17.917 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.916      0.835    51.4   <2e-16 ***
## cylinders     -3.558      0.146   -24.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.604 
## F-statistic:  597 on 1 and 390 DF,  p-value: <2e-16
## 
## 
## Call:
## lm(formula = mpg ~ as.factor(cylinders), data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.284  -2.904  -0.963   2.344  18.027 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             20.550      2.349    8.75  < 2e-16 ***
## as.factor(cylinders)4    8.734      2.373    3.68  0.00027 ***
## as.factor(cylinders)5    6.817      3.589    1.90  0.05825 .  
## as.factor(cylinders)6   -0.577      2.405   -0.24  0.81071    
## as.factor(cylinders)8   -5.587      2.395   -2.33  0.02015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.638 
## F-statistic:  173 on 4 and 387 DF,  p-value: <2e-16
## 
## Analysis of Variance Table
## 
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ as.factor(cylinders)
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1    390 9416                              
## 2    387 8544  3       871 13.2 3.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 Results

Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.

  1. Describe the final model. Include diagnostic plots with particular focus on the model residuals and diagnoses.

  2. Summarize the effects found.

  3. Predict the mpg of the following car: A red car built in the US in 1983 that is 180 inches long, has eight cylinders, displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of 260. Also give a 95% CI for your prediction.

Student Answer:

Final model: keeping mpg in numeric as the response variable, we include cylinders in factors, origin in factors, and year in numeric as exploratory variables into the linear regression model.

Examining the several assumptions for linear regressions, first we didn’t include displacement, weight, and horsepower to avoid multicollinearity based on the correlation matrix in the EDA part. We also remove variables that show p-value > 0.05 and eventually reach the final model below. Second, looking at the residual vs. fitted plot, we can argue that our final model satisfies the homoscedasticity mostly. Third, looking at the normal QQ plot, we can argue that our final model satisfies the normality assumption mostly, with acceptable deviation in the higher quantile range.

Cylinders: If the other variables are held constant, when the number of cylinders of the vehicle is 4,5,6,8, we expect on average the vehicle’s miles per gallon value will increase by 9.47, 6.20, 2.40, -0.74 respectively.

Origin: If the other variables are held constant are held constant, when the vehicle origin is by European and Japanese, compared with when the vehicle origin is American, we expect on average the vehicle’s miles per gallon value will increase by 1.66 and 3.65 respecitvely.

Year: Weight: If the other variables are held constant are held constant, when the model year of the vehicle increase by 1 unit, we expect on average the vehicle’s miles per gallon value will increase by 0.74.

Prediction: MPG = 21.7 /miles per gallon; 95% CI for prediction: (20.5, 23) /miles per gallon

library(car)
final_model <- lm(mpg ~ as.factor(cylinders)+as.factor(origin)+year, data = auto_data)
## cylinders+displacement+horsepower+weight+acceleration+year+origin
summary(final_model)
Anova(final_model)
## 
## Call:
## lm(formula = mpg ~ as.factor(cylinders) + as.factor(origin) + 
##     year, data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.782  -2.214  -0.416   2.060  13.870 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -39.2808     4.7010   -8.36  1.2e-15 ***
## as.factor(cylinders)4   9.4719     1.9347    4.90  1.4e-06 ***
## as.factor(cylinders)5   6.2021     2.9595    2.10   0.0368 *  
## as.factor(cylinders)6   2.3960     1.9987    1.20   0.2314    
## as.factor(cylinders)8  -0.7457     2.0076   -0.37   0.7105    
## as.factor(origin)2      1.6631     0.6285    2.65   0.0085 ** 
## as.factor(origin)3      3.6529     0.5898    6.19  1.5e-09 ***
## year                    0.7441     0.0566   13.16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.76 on 384 degrees of freedom
## Multiple R-squared:  0.772,  Adjusted R-squared:  0.768 
## F-statistic:  186 on 7 and 384 DF,  p-value: <2e-16
## 
## Anova Table (Type II tests)
## 
## Response: mpg
##                      Sum Sq  Df F value  Pr(>F)    
## as.factor(cylinders)   4590   4    81.1 < 2e-16 ***
## as.factor(origin)       543   2    19.2 1.2e-08 ***
## year                   2449   1   173.1 < 2e-16 ***
## Residuals              5433 384                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,2), mar=c(5,2,4,2), mgp=c(3,0.5,0)) # plot(fit3) produces several plots
plot(final_model, 1, pch=16) # residual plot
abline(h=0, col="blue", lwd=2)
plot(final_model, 2) # qqplot

pred_car = data.frame(displacement = 350, cylinders = as.factor(8), origin = as.factor(1), horsepower = 260, weight = 4000, year = 83)
pred <- predict(final_model,newdata = pred_car,interval = "confidence",se.fit = TRUE)
pred
## $fit
##    fit  lwr upr
## 1 21.7 20.5  23
## 
## $se.fit
## [1] 0.634
## 
## $df
## [1] 384
## 
## $residual.scale
## [1] 3.76
### Other exploration


# rmse = function(a,b) { sqrt(mean((a-b)^2)) }
# 
# library(leaps)
# fit.exh <- regsubsets(mpg ~as.factor(cylinders)+displacement+horsepower+weight+acceleration+year+as.factor(origin), auto_data , nvmax=25, method="exhaustive")
# 
# f.e <- summary(fit.exh)
# plot(f.e$cp, xlab="Number of predictors", ylab="Cp", col="red", pch=16)
# 
# opt.size <- which.min(f.e$cp) 
# opt.size
# fit.exh.var <- f.e$which
# colnames(fit.exh.var)[fit.exh.var[opt.size,]]

4 Simple Regression through simulations

4.1 Linear model through simulations

This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.

Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).

We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:

# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)

4.1.1 Generate data

Create a corresponding output vector for \(\mathbf{y}\) according to the equation given above. Use set.seed(1). Then, create a scatterplot with \((x_i, y_i)\) pairs. Base R plotting is acceptable, but if you can, please attempt to use ggplot2 to create the plot. Make sure to have clear labels and sensible titles on your plots.

set.seed(1)
y <- 1 + 1.2 * x + rnorm(length(x), mean = 0, sd = 2.0)

data.frame("x" = x, "y" = y) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point((aes(color = "Data"))) + 
  labs(title = "scatter plot of x - y", x = "x", y = "y")

4.1.2 Understand the model

  1. Find the LS estimates of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\), using the lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates look to be good?

Student Answer: \(\hat{\boldsymbol{\beta}}_0 = 1.331, \hat{\boldsymbol{\beta}}_1 = 0.906\). The true values: \(\boldsymbol{\beta}_0 = 1.0, \boldsymbol{\beta}_1 = 1.2\). The estimates are not particularly good due to large amount of noise.

fit <- lm(y ~ x, data.frame("x" = x, "y" = y))
fit$coefficients
## (Intercept)           x 
##       1.331       0.906
  1. What is your RSE for this linear model fit? Is it close to \(\sigma = 2\)?

Student Answer: \(RSE = 1.79\). It is relatively close to \(\sigma = 2\).

summary(fit)
## 
## Call:
## lm(formula = y ~ x, data = data.frame(x = x, y = y))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.662 -0.880  0.014  1.247  2.882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1.331      0.557    2.39    0.022 *
## x              0.906      0.959    0.95    0.350  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.79 on 38 degrees of freedom
## Multiple R-squared:  0.023,  Adjusted R-squared:  -0.00272 
## F-statistic: 0.894 on 1 and 38 DF,  p-value: 0.35
  1. What is the 95% confidence interval for \(\boldsymbol{\beta}_1\)? Does this confidence interval capture the true \(\boldsymbol{\beta}_1\)?

Student Answer: The \(95\%\) interval for \(\boldsymbol{\beta}_1\) is \([-1.034, 2.85]\). The interval captures the true \(\boldsymbol{\beta}_1 = 1.2\).

confint(fit)
##              2.5 % 97.5 %
## (Intercept)  0.203   2.46
## x           -1.034   2.85
  1. Overlay the LS estimates and the true lines of the mean function onto a copy of the scatterplot you made above.
y_true <- 1 + 1.2 * x

data.frame("x" = x, "y" = y) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point(aes(color = "Data")) + 
  geom_smooth(aes(color = "LS Estimate"), method = "lm", formula = y ~ x, size = 0.5) + 
  geom_line(data = data.frame("x" = x, "y" = y_true), aes(color = "True Line")) + 
  labs(title = "relationship between x - y", x = "x", y = "y")

4.1.3 diagnoses

  1. Provide residual plot where fitted \(\mathbf{y}\)-values are on the x-axis and residuals are on the y-axis.
data.frame("y" = fit$fitted.values, "res" = fit$residuals) %>%
  ggplot(aes(x = y, y = res)) +
  geom_point() + 
  labs(title = "scatter plot of fitted y - residuals", x = "fitted y", y = "residuals")

  1. Provide a normal QQ plot of the residuals.
qqnorm(fit$residuals)
qqline(fit$residuals)

  1. Comment on how well the model assumptions are met for the sample you used.

Student Answer: From i), we can see that the homoscedasticity assumption is well met by the data; And from ii), we can see that the normality assumption is well met by the data.

4.2 Understand sampling distribution and confidence intervals

This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.

Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.

lm_sim <- function(n_sim)
{
  # Inializing variables. Note b_1, upper_ci, lower_ci are vectors
  x <- seq(0, 1, length = 40) 
  b1 <- 0                   # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
  upper_ci <- 0             # upper bound for beta_1. Initialize to 0 for now.
  lower_ci <- 0             # lower bound for beta_1. Initialize to 0 for now.
  t_star <- qt(0.975, 38)   # Food for thought: why 38 instead of 40? What is t_star?
  
  # Perform the simulation
  for (i in 1:n_sim){
    y <- 1 + 1.2 * x + rnorm(40, sd = 2.0)
    lse <- lm(y ~ x)
    lse_output <- summary(lse)$coefficients
    se <- lse_output[2, 2]
    b1[i] <- lse_output[2, 1]
    upper_ci[i] <- b1[i] + t_star * se
    lower_ci[i] <- b1[i] - t_star * se
  }
  results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
  
  return(results)
}
  1. Summarize the LS estimates of \(\boldsymbol{\beta}_1\) (stored in results$b1). Does the sampling distribution agree with theory?

Student Answer: The sampling distribution agrees well with the theory: \(\boldsymbol{\beta}_1\) is Gaussian distributed, with a mean around \(1.2\) (i.e., unbiased) and a variance around \(\frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = 1.14\). (We have increased n_sim to 10,000 to get a more accurate number.)

results <- lm_sim(10000)

ggplot(results) +
  geom_histogram(aes(x = b1), binwidth = 0.2)

qqnorm(results$b1)
qqline(results$b1)

mean(results$b1)
## [1] 1.22
var(results$b1)
## [1] 1.14
  1. How many of your 95% confidence intervals capture the true \(\boldsymbol{\beta}_1\)? Display your confidence intervals graphically.

Student Answer: Out of 100 simulations, the 95% CI for about 96 of them capture the true \(\boldsymbol{\beta}_1\) (of course this number can vary slightly from each run due to randomness in the sampling, here we have fixed the seed = 1 for replication purposes).

set.seed(1)
results <- lm_sim(100)
sum(results$lower_ci < 1.2 & results$upper_ci >= 1.2)
## [1] 96
results <-
  arrange(results, lower_ci, upper_ci)
results$index = 1:100

ggplot(results) + 
  geom_pointrange(aes(x = index, y = b1, ymin = lower_ci, ymax = upper_ci)) + 
  geom_hline(yintercept=1.2, linetype="dashed", color = "red", size=1) + 
  labs(x = '# simulation (sorted)', y = 'b1 and 95% CI')